[1] 4
The Essentials of R Programming
Charlotte Fresenius Privatuniversität
March 21, 2025
File > New File > R script.Ctrl + Enter (or Cmd + Enter on Mac).First, let’s use R as a calculator. We can write a calculation into our R script and R will give us the result, when the code is executed:
We can also perform comparisons using R:
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
R comes with many functions that you can use to perform tasks from simple to sophisticated. Functions have inputs (aka arguments) that you pass into them and outputs (aka return values) that they give back. Functions are fundamental to how R works.
Let’s see an example. Say, we want to round the number 5.293586 to two decimal digits. Fortunately, there is a function in R called round. But how do we use it?
To find out, we can bring up the R help, which provides documentation for every function in R, by typing ? and the name of the function into the console:
Generally, we can use the args function to see the arguments that a function takes. For example, let’s say, we wanted to simulate someone throwing a regular die, i.e. randomly sample numbers from 1 to 6. Luckily, there is an R function called sample. Let’s inspect it.
So, to simulate 20 throws of a regular die, we have to use the function like this:
Note that we can give the arguments to the function
Let’s see another example of this.
Let’s say, we wanted to compute the base 2 logarithm of 10. For this, we need the log function in R:
So, based on the three options given on the previous slide, all of the following three ways of calling log lead to the correct result:
However, options 1 and 3 are preferable, as switching the order of arguments tends to make code less easily readable for others (which could be yourself in the future…)
Let’s say, you were given the task to solve the quadratic equation
\[3x^2 - 5x - 1 = 0\]
You will remember from back in your high school days that we can use the “midnight formula” for this:
\[x_{1/2} = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}\]
With R, we can easily compute this of course:
Next, we are given another quadratic equation
\[4x^2 - 8x + 2 = 0\]
To compute the solution with R, we would have to replace every occurrence of \(a\), \(b\) and \(c\), so in total, we would have to make 10 replacements to get both solutions. That’s too cumbersome and error-prone…
Instead, we can define variables \(a\), \(b\) and \(c\) and simply assign different values to them every time we have to solve a quadratic equation. Such assignments happen in R with the help of the assignment operator <- (read it as “gets”):
Now, we can write expressions using these variables like we would in maths.
Now, to compute the solution to the second quadratic equation, we simply re-define the variables a, b and c and then evaluate the same expression again:
[1] 1.707107
[1] 0.2928932
With assignments, we are creating objects in R that are saved and can be referenced by the names that we give them (e.g. a, b and c). Creating objects like this will make them appear in the work space in pane 4 of the RStudio window:
We can also see all variables currently defined in the work space by typing ls():
With the assignment operator, we can also define our own functions of course! To define a function, we need a function name, arguments, a function body and a return value:
Consider the following example: say we want to write a function called quadratic_solver (function name) that gives us the solution to any quadratic equation. It needs input arguments a, b and c and should return the two solutions. In the function body, the two solutions should be computed for the three inputs. So, we could create the function as follows:
In R, the following six data types are available:
TRUE or FALSE),We can find the type of an object using the function typeof. We can verify, whether an object is of a certain <type> by using the function is.<type>. Let’s see some examples.
By default, R will save any number that you type in as a double.
[1] "double"
[1] "double"
[1] TRUE
[1] TRUE
Together with integer, the data type double is one of the two numeric types, i.e. representing numbers:
Integers (whole numbers) are (positive or negative) numbers that can be written without a decimal component. This data type is more important for developers as it saves memory (compared to doubles). To specify an integer over a double, the number has to be followed by an uppercase L:
(Pure) integers are also numeric of course:
Text data is represented in R with the help of the character data type. To demarcate a string of characters, you can use double or single quotes ("" or '').
[1] "character"
[1] "character"
[1] TRUE
[1] TRUE
Of course, character objects are not numeric.
When working with relational operators, we already saw a couple of instances of the logical data type. It can only take the values TRUE and FALSE. Negation of a logical object (i.e. saying NOT) can be achieved with the help of the ! operator:
Logical objects are also not numeric.
We can coerce an object to be of a certain <type> by using the function as.<type>. This process is called type coercion.
In some cases, this is very intuitive…
[1] 2.71828
[1] 10
[1] 10
[1] "42"
Non-sensical attempts at coercion are translated as NA (not available).
NA is one of four special values. While NA can occur for any data type, the other three can only occur for numeric data types. These are:
Inf: positive infinity-Inf: negative infinityNaN: not a numberWhile they are not technically numbers, these special values still follow logical rules when applying mathematical operations on them.
Unlike NaN, NAs are genuinely unknown values. Nevertheless, they also have their logical rules. Consider the following example. Let’s think about why each of the following four results makes sense:
Note: all special values have their own is.<special> function to check for them, i.e. is.na, is.nan and is.infinite.
vectorlistmatrixdata.framefactorDateAn atomic vector is a simple vector of values of one data type. Values can be concatenated together into a vector using the c function. For example:
Since we have filled the vector x with values of type double, it will itself also be of type double. It has several attributes / characteristics such as length:
What happens if we try to create a vector with data of different types?
When we attempt to combine different types, R will coerce the data in a fixed order, namely character \(\to\) double \(\to\) integer \(\to\) logical, i.e. if any data of a higher-order type appears in the vector creation, the vector will be of that type:
Most operations in R are vectorized, which means they are (automatically) performed element by element:
This also applies when we want to add / subtract / multiply / divide two vectors element-wise:
If two vectors are of different lengths, then R recycles the smaller one to allow operations like the following:
However, beware of recycling:
To create vectors that are sequences, there is a very useful R function called seq:
[1] 1 2 3 4 5 6 7 8 9 10
[1] 2 4 6 8 10 12 14 16 18 20
[1] 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
Sequences of consecutive integers – like the first one we saw – are particularly frequently needed in (R) programming. For this reason, there is a short-hand notation (“syntactic sugar”) to create sequences of that kind, which is start_value:end_value:
If we want to extract or replace elements of a vector, you can use square brackets [] with logical or numeric input:
alphabet <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n",
"o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z")
# Numeric subsetting
alphabet[1:3] # first three[1] "a" "b" "c"
[1] "c" "b" "a"
[1] "b" "d"
[1] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE
[1] "E"
We can also use the which function to turn a logical vector into an integer vector that gives the indices of all elements in the vector that are TRUE:
With the %in% operator, we can check whether elements of the vector are contained in another vector:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[25] TRUE TRUE
[1] 24 25 26
[1] "x" "y" "z"
For sorting vectors, we can use the sort function. This works with character vectors:
[1] "d" "E" "m" "y" "t" "b" "h" "c" "a" "j" "v" "n" "k" "f" "q" "p" "z" "l" "x"
[20] "u" "w" "i" "r" "g" "s" "o"
[1] "a" "b" "c" "d" "E" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
… as well as with numeric vectors:
Lists are a step up in complexity from atomic vectors: each element can be any type, not just vectors. We construct lists with the function list:
[[1]]
[1] 1 2 3
[[2]]
[1] "R"
[[3]]
[1] TRUE FALSE TRUE
[[4]]
[1] 2.3 5.9
[1] "list"
Lists are sometimes called recursive vectors because a list can contain other lists. This makes them fundamentally different from atomic vectors.
The elements of a list can also have names. These can be accessed with the help of the $ operator. Alternatively, to access a single element of a list, we can also use double square brackets [[]]:
For subsetting lists, essentially the same rules apply as for atomic vectors, i.e. we can subset using numeric or logical arguments and single square brackets []:
One way to construct more complex data structures on top of elementary building blocks like vectors or lists is to assign a class to them. A class is metadata about the object that can determine how common functions operate on that object.
Probably the easiest example of this is a matrix. A matrix is just a two-dimensional array of numbers:
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
The function matrix will fill up the matrix column by column by default. There is an option to fill them by row instead, see ?matrix.
Technically speaking, in R, a matrix is actually just a vector with an attribute that specifies the dimensions (i.e. number of rows and columns) of the matrix. We can access the dimensions of a matrix with the help of the functions dim, nrow and ncol:
We can add additional rows or columns with the help of rbind or cbind:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 1 1
[2,] 1 4 7 10 13
[3,] 2 5 8 11 14
[4,] 3 6 9 12 15
As a matrix is a two-dimensional object, we need two indices separated by a comma for subsetting. If we do not specify one dimension, all elements across that dimension are selected:
A B C D E
1 4 7 10 13
row_1 row_2 row_3 row_4
1 7 8 9
[1] 11
A B C D E
3 6 9 12 15
As a matrix is basically just a vector underneath, all values still need to be of the same data type. However, real tabular data often includes data of different types (e.g. name, height, education). To represent this, we need a data structure where the columns can be of different types.
In R, such a data structure is provided by the data.frame. Underlying it is a named list, whose elements represent columns. Therefore, all elements need to have the same length. Consider the following example:
Let’s inspect the “list nature” of the data.frame object df we just created:
Even though df is fundamentally a list, it behaves differently than a standard list:
heights names educ
1 176 Anna BSc
2 178 Jakob MA
3 156 Lisa PhD
$heights
[1] 176 178 156
$names
[1] "Anna" "Jakob" "Lisa"
$educ
[1] "BSc" "MA" "PhD"
This is because, on top of being a list, the object df is of class data.frame, which changes how common functions operate on it.
Since a data.frame also represents two-dimensional data, many of the functions we used for matrices work on it as well:
However, as data.frames are lists underneath, we can also use the $ operator to access the elements of the list as before.
This is a particularly common action when working with real data, as it allows us to access the variables in the columns of the data.frame. For example, let’s say we made a mistake and found out that Jakob was really 187 cm tall. We could change the corresponding data point as follows:
A factor is R’s way to represent categorical data. Say, for example, we want to add the sex of the three people in our data set to the data.frame. For this, we create a factor and add it as an additional column to df:
heights names educ sex
1 176 Anna BSc f
2 187 Jakob MA m
3 156 Lisa PhD f
A factor has levels that represent the categories that this variable can take (here: “m” for male and “f” for female). In the background, R stores these levels as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters.
A factor is another example of a data structure that is built on top of an atomic vector using a class attribute. The data is stored as an integer vector (data type), but because the object is of class factor, R uses different methods to act on the object (compared to a standard integer vector). This type of behaviour is one of the things that makes R very powerful for data science.
[1] "integer"
[1] 1 2 1
[1] "factor"
[1] f m f
Levels: f m
As we will see, properly encoding categorical data as factors is an essential step in data preparation.
Finally, a common type of data that needs to be represented in R are dates. Dates are typically represented in some sort of format like “DD-MM-YYYY” in the European context, for instance. Let’s say, we are given Anna’s, Jakob’s and Lisa’s birthday in different formats:
R has a specific way of recognizing such date formats using so called format strings. For each way of representing date information, there is a corresponding format string. Some important ones are:
| Format string | Description | Format string | Description |
|---|---|---|---|
| %Y | Year with century | %y | Year without century |
| %m | Month of year (01-12) | %j | Day of the year (0-366) |
| %d | Day of month (01-31) | %W | Calendar week (0-53) |
| %B | Full month (e.g. June) | %b | Abbreviated month (e.g. Jun) |
So let’s try to use this to convert our dates (which are now just character objects) to actual dates that R understands using as.Date:
Dates are stored in R as the number of days since Unix time, which is the 1st January, 1970. Hence, date vectors are simply double vectors of class Date.
Note that these operations cannot be performed on the original character strings representing the birthdays:
To extract individual parts of a date (like year, month, day, weekday, etc.), we can use the format function in conjunction with the format strings we saw before for date creation:
Conditionals are one of the basic features of programming. They are used for what is called control flow. The most common conditional expression is the if-else statement. It’s best illustrated with an example:
[1] "Can't divide by zero!"
Let’s incorporate conditionals into our quadratic_solver function from before. Remember that a quadratic equation has no real solutions if the discriminant \(D\) is negative (because of the root in the “midnight” formula):
\[D = b^2 - 4ac\]
So far, when this is case, our function produces NaNs and a warning:
Let’s fix this by:
One solution could be the following:
quadratic_solver_new <- function(a, b, c){
D <- b^2 - 4*a*c
if(D < 0){
print("Discriminant is negative, no real solutions!")
return(NA)
} else if(D > 0) {
print("Discriminant is positive, two real solutions!")
sol_1 <- (-b + sqrt(D))/(2*a)
sol_2 <- (-b - sqrt(D))/(2*a)
return(c(sol_1, sol_2))
} else {
print("Discriminant is zero, one real solution!")
sol <- -b/(2*a)
return(sol)
}
}if-else statements like the ones we saw only work on a single logical. For a vectorized version, there is the very useful ifelse function:
This function takes three arguments: a logical and two possible answers. If the logical is TRUE, the value in the second argument is returned and if FALSE, the value in the third argument is returned. When operating on vectors, ifelse takes the corresponding elements of the second or third argument.
Here’ another example:
[1] "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even"
Note: x %% 2 gives the remainder when dividing x by 2.
In general, loops control flow structures that enable the repeated execution of a code block as long as a specified condition is met. This saves a lot of manual work and code duplication. We will discuss two types of loops: for loops and while loops.
for loops are used to iterate over items in a vector. The logic is “for every item in this vector, do the following”. This logic is implemented in the following basic form:
Following this notation, we can refer to the element of the vector in the current loop cycle with the name of item. In the first cycle, it will be the first element of the vector, in the second, it will be the second, and so on…
Let’s see an example of a for loop that prints both solutions to the quadratic equation from before in a nice format:
[1] "Discriminant is positive, two real solutions!"
[1] "Solution 1: 1.434"
[1] "Solution 2: 0.232"
Note that we do not have to loop through an integer vector, we can loop through any atomic vector or list. For example, we could loop through all the elements of the list l1 from earlier and print it to the console only if it is a numeric vector:
Another type of loop is a while loop. It repeats a specified action as long as a certain condition is met. It has the following basic form:
Note that for and while loops are interchangeable in every circumstance, i.e. every for loop can be implemented as a while loop and vice versa. Usually, the choice between the two is a question of code readability and efficiency considerations. Compare the following two examples:
In R, a very commonly used alternative to loops is the use of functionals. Functionals are functions that take another function as an input and returns a vector as output. Common functionals we will have a look at are lapply, and apply. Let’s start with lapply.
lapply requires as arguments an atomic vector or a list and a function that it should apply to each element of that atomic vector or list. Let’s say we wanted to sort each vector in a list of vectors. We could do:
In fact, we can pass named arguments that we would usually pass to the function to be applied directly to lapply instead:
[[1]]
[1] 10 9 8 7 6 5 4 3 2 1
[[2]]
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Instead of using lapply with a pre-existing (or user-defined) function, we can also create an inline function that exists only for the purpose of that lapply call. For example, we might want to square each vector in the list after sorting:
[[1]]
[1] 100 81 64 49 36 25 16 9 4 1
[[2]]
[1] 400 361 324 289 256 225 196 169 144 121 100 81 64 49 36 25 16 9 4
[20] 1
Such functions are called anonymous functions as they do not have a name.
The matrix equivalent of lapply is called apply. It can apply a given function to every row and / or every column of a matrix. Besides requiring the matrix and the function to be applied, it requires an indication of whether application should happen over rows (1) or columns (2). Consider the following example to compute the row and column sums of a matrix:
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[1] 12 15 18
[1] 6 15 24
The use of functionals like lapply and apply is usually preferable over loops because it pre-specifies what should happen with the result (e.g. lapply will always hold the results in a list).
As with any software, an R package only needs to be installed once.
After installation, we need to tell R to make the functions provided by this package available to us in our current R session. For this we call the function library on the name of the package:
? operator.Data Science and Data Analytics – R essentials